library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.6 ✓ dplyr 1.0.8
✓ tidyr 1.2.0 ✓ stringr 1.4.0
✓ readr 2.0.2 ✓ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(lubridate)
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
library(janitor)
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(broom)
library(modelr)
Attaching package: ‘modelr’
The following object is masked from ‘package:broom’:
bootstrap
library(caret)
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift
library(leaps)
library(GGally)
library(ggfortify)
raw_avocado <- read_csv("data/avocado.csv")
New names:
* `` -> ...1
Rows: 18249 Columns: 14
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): type, region
dbl (11): ...1, AveragePrice, Total Volume, 4046, 4225, 4770, Total Bags, Small Bags, Large Bags, XLarge Bags, year
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We’ve looked at a few different ways in which we can build models this week, including how to prepare them properly. This weekend we’ll build a multiple linear regression model on a dataset which will need some preparation. The data can be found in the data folder, along with a data dictionary
We want to investigate the avocado dataset, and, in particular, to model the AveragePrice of the avocados. Use the tools we’ve worked with this week in order to prepare your dataset and find appropriate predictors. Once you’ve built your model use the validation techniques discussed on Wednesday to evaluate it. Feel free to focus either on building an explanatory or a predictive model, or both if you are feeling energetic!
As part of the MVP we want you not to just run the code but also have a go at interpreting the results and write your thinking in comments in your script.
Hints and tips
region may lead to many dummy variables. Think carefully about whether to include this variable or not (there is no one ‘right’ answer to this!) Think about whether each variable is categorical or numerical. If categorical, make sure that the variable is represented as a factor. We will not treat this data as a time series, so Date will not be needed in your models, but can you extract any useful features out of Date before you discard it? If you want to build a predictive model, consider using either leaps or glmulti to help with this.
summary(raw_avocado)
...1 Date AveragePrice Total Volume 4046 4225 4770
Min. : 0.00 Min. :2015-01-04 Min. :0.440 Min. : 85 Min. : 0 Min. : 0 Min. : 0
1st Qu.:10.00 1st Qu.:2015-10-25 1st Qu.:1.100 1st Qu.: 10839 1st Qu.: 854 1st Qu.: 3009 1st Qu.: 0
Median :24.00 Median :2016-08-14 Median :1.370 Median : 107377 Median : 8645 Median : 29061 Median : 185
Mean :24.23 Mean :2016-08-13 Mean :1.406 Mean : 850644 Mean : 293008 Mean : 295155 Mean : 22840
3rd Qu.:38.00 3rd Qu.:2017-06-04 3rd Qu.:1.660 3rd Qu.: 432962 3rd Qu.: 111020 3rd Qu.: 150207 3rd Qu.: 6243
Max. :52.00 Max. :2018-03-25 Max. :3.250 Max. :62505647 Max. :22743616 Max. :20470573 Max. :2546439
Total Bags Small Bags Large Bags XLarge Bags type year region
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0 Length:18249 Min. :2015 Length:18249
1st Qu.: 5089 1st Qu.: 2849 1st Qu.: 127 1st Qu.: 0.0 Class :character 1st Qu.:2015 Class :character
Median : 39744 Median : 26363 Median : 2648 Median : 0.0 Mode :character Median :2016 Mode :character
Mean : 239639 Mean : 182195 Mean : 54338 Mean : 3106.4 Mean :2016
3rd Qu.: 110783 3rd Qu.: 83338 3rd Qu.: 22029 3rd Qu.: 132.5 3rd Qu.:2017
Max. :19373134 Max. :13384587 Max. :5719097 Max. :551693.7 Max. :2018
We have 18248 rows and 14 variables
# Clean Names
raw_avocado <- raw_avocado %>%
clean_names()
# Fix the date field as it is not currently a date field
raw_avocado<- raw_avocado %>%
mutate(date= ymd(date))
# Add in a month column
raw_avocado<- raw_avocado %>%
mutate(month = month(date, label = TRUE, abbr = FALSE))
raw_avocado %>%
group_by(month) %>%
summarise(count=n())
Perhaps group the Months into quarters
# Add in a quarter column
raw_avocado<- raw_avocado %>%
mutate(quarter = quarter(date))
# Box plot comparing type (conventional vs organic)
ggplot(raw_avocado, aes(x=as.factor(type), y=average_price)) +
geom_boxplot(fill="slateblue", alpha=0.2) +
xlab("cyl")
So the organic avocadoes drive the price up
# Simple line graphs looking at some of the variables
ggplot(raw_avocado, aes(x=average_price)) +
geom_line(aes(y = x4225), color = "orange", alpha = 0.4) +
geom_line(aes(y = x4046), color = "darkred", alpha = 0.4) +
geom_line(aes(y = x4770), color="steelblue", alpha = 0.4)
Doesn’t really tell us much - but we get an idea of the shape of the data.
regions <- raw_avocado %>%
group_by(region) %>%
summarise(count = n())
There are 54 regions, with the same number of observations from each. For modelling this could be a problem - but perhaps we can find one or two regions that are key for driving up prices.
Perhaps we should look at some simple stats per region.
regions <- raw_avocado %>%
group_by(region) %>%
summarise(count = n(), mean(average_price), mean(x4046), mean(x4225),
mean(x4770))
regions
raw_avocado %>%
ggplot(aes(x = average_price, y = region)) +
geom_boxplot()
Phew - what a mess
Let’s rotate it
raw_avocado %>%
ggplot(aes(x = region, y = average_price)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45))
Ugly graph - but gives us a glimpse at the variation between regions - so perhaps this is important after all.
# Tidy up variables
# Remove row count, date and month
avocado_trim <- raw_avocado %>%
select(-c(x1, date, month))
alias(lm(average_price ~ ., data = avocado_trim))
Model :
average_price ~ total_volume + x4046 + x4225 + x4770 + total_bags +
small_bags + large_bags + x_large_bags + type + year + region +
quarter
Looks like we have no aliased variables - we are good to go
# This causes errors because of the regions
avocado_trim %>%
GGally::ggpairs()
Error in stop_if_high_cardinality(data, columns, cardinality_threshold) :
Column 'region' has more levels (54) than the threshold (15) allowed.
Please remove the column or increase the 'cardinality_threshold' parameter. Increasing the cardinality_threshold may produce long processing times
# Let's see if it works if we convert to numeric/non-numeric
avocado_trim_numeric <- avocado_trim %>%
select_if(is.numeric)
avocado_trim_nonnumeric <- avocado_trim %>%
select_if(function(x) !is.numeric(x))
avocado_trim_nonnumeric$price <- avocado_trim$price
Warning: Unknown or uninitialised column: `price`.
ggpairs(avocado_trim_numeric)
plot: [1,1] [>--------------------------------------------------------------------------------------------------------------------] 1% est: 0s
plot: [1,2] [=>-------------------------------------------------------------------------------------------------------------------] 2% est: 8s
plot: [1,3] [==>------------------------------------------------------------------------------------------------------------------] 2% est: 8s
plot: [1,4] [===>-----------------------------------------------------------------------------------------------------------------] 3% est: 8s
plot: [1,5] [====>----------------------------------------------------------------------------------------------------------------] 4% est: 8s
plot: [1,6] [=====>---------------------------------------------------------------------------------------------------------------] 5% est: 8s
plot: [1,7] [======>--------------------------------------------------------------------------------------------------------------] 6% est: 9s
plot: [1,8] [=======>-------------------------------------------------------------------------------------------------------------] 7% est: 9s
plot: [1,9] [========>------------------------------------------------------------------------------------------------------------] 7% est: 9s
plot: [1,10] [=========>----------------------------------------------------------------------------------------------------------] 8% est: 9s
plot: [1,11] [==========>---------------------------------------------------------------------------------------------------------] 9% est: 9s
plot: [2,1] [===========>---------------------------------------------------------------------------------------------------------] 10% est: 8s
plot: [2,2] [============>--------------------------------------------------------------------------------------------------------] 11% est: 8s
plot: [2,3] [=============>-------------------------------------------------------------------------------------------------------] 12% est: 9s
plot: [2,4] [==============>------------------------------------------------------------------------------------------------------] 12% est: 9s
plot: [2,5] [==============>------------------------------------------------------------------------------------------------------] 13% est: 9s
plot: [2,6] [===============>-----------------------------------------------------------------------------------------------------] 14% est: 9s
plot: [2,7] [================>----------------------------------------------------------------------------------------------------] 15% est: 8s
plot: [2,8] [=================>---------------------------------------------------------------------------------------------------] 16% est: 8s
plot: [2,9] [==================>--------------------------------------------------------------------------------------------------] 17% est: 8s
plot: [2,10] [===================>------------------------------------------------------------------------------------------------] 17% est: 8s
plot: [2,11] [====================>-----------------------------------------------------------------------------------------------] 18% est: 8s
plot: [3,1] [=====================>-----------------------------------------------------------------------------------------------] 19% est: 8s
plot: [3,2] [======================>----------------------------------------------------------------------------------------------] 20% est: 8s
plot: [3,3] [=======================>---------------------------------------------------------------------------------------------] 21% est: 7s
plot: [3,4] [========================>--------------------------------------------------------------------------------------------] 21% est: 7s
plot: [3,5] [=========================>-------------------------------------------------------------------------------------------] 22% est: 7s
plot: [3,6] [==========================>------------------------------------------------------------------------------------------] 23% est: 7s
plot: [3,7] [===========================>-----------------------------------------------------------------------------------------] 24% est: 7s
plot: [3,8] [============================>----------------------------------------------------------------------------------------] 25% est: 7s
plot: [3,9] [=============================>---------------------------------------------------------------------------------------] 26% est: 7s
plot: [3,10] [==============================>-------------------------------------------------------------------------------------] 26% est: 7s
plot: [3,11] [===============================>------------------------------------------------------------------------------------] 27% est: 7s
plot: [4,1] [================================>------------------------------------------------------------------------------------] 28% est: 6s
plot: [4,2] [=================================>-----------------------------------------------------------------------------------] 29% est: 6s
plot: [4,3] [==================================>----------------------------------------------------------------------------------] 30% est: 6s
plot: [4,4] [===================================>---------------------------------------------------------------------------------] 31% est: 7s
plot: [4,5] [====================================>--------------------------------------------------------------------------------] 31% est: 6s
plot: [4,6] [=====================================>-------------------------------------------------------------------------------] 32% est: 6s
plot: [4,7] [======================================>------------------------------------------------------------------------------] 33% est: 6s
plot: [4,8] [=======================================>-----------------------------------------------------------------------------] 34% est: 6s
plot: [4,9] [========================================>----------------------------------------------------------------------------] 35% est: 6s
plot: [4,10] [========================================>---------------------------------------------------------------------------] 36% est: 6s
plot: [4,11] [=========================================>--------------------------------------------------------------------------] 36% est: 6s
plot: [5,1] [===========================================>-------------------------------------------------------------------------] 37% est: 6s
plot: [5,2] [===========================================>-------------------------------------------------------------------------] 38% est: 6s
plot: [5,3] [============================================>------------------------------------------------------------------------] 39% est: 6s
plot: [5,4] [=============================================>-----------------------------------------------------------------------] 40% est: 6s
plot: [5,5] [==============================================>----------------------------------------------------------------------] 40% est: 6s
plot: [5,6] [===============================================>---------------------------------------------------------------------] 41% est: 5s
plot: [5,7] [================================================>--------------------------------------------------------------------] 42% est: 5s
plot: [5,8] [=================================================>-------------------------------------------------------------------] 43% est: 5s
plot: [5,9] [==================================================>------------------------------------------------------------------] 44% est: 5s
plot: [5,10] [===================================================>----------------------------------------------------------------] 45% est: 5s
plot: [5,11] [====================================================>---------------------------------------------------------------] 45% est: 5s
plot: [6,1] [=====================================================>---------------------------------------------------------------] 46% est: 5s
plot: [6,2] [======================================================>--------------------------------------------------------------] 47% est: 5s
plot: [6,3] [=======================================================>-------------------------------------------------------------] 48% est: 5s
plot: [6,4] [========================================================>------------------------------------------------------------] 49% est: 5s
plot: [6,5] [=========================================================>-----------------------------------------------------------] 50% est: 5s
plot: [6,6] [==========================================================>----------------------------------------------------------] 50% est: 5s
plot: [6,7] [===========================================================>---------------------------------------------------------] 51% est: 5s
plot: [6,8] [============================================================>--------------------------------------------------------] 52% est: 4s
plot: [6,9] [=============================================================>-------------------------------------------------------] 53% est: 4s
plot: [6,10] [=============================================================>------------------------------------------------------] 54% est: 4s
plot: [6,11] [==============================================================>-----------------------------------------------------] 55% est: 4s
plot: [7,1] [================================================================>----------------------------------------------------] 55% est: 4s
plot: [7,2] [=================================================================>---------------------------------------------------] 56% est: 4s
plot: [7,3] [==================================================================>--------------------------------------------------] 57% est: 4s
plot: [7,4] [===================================================================>-------------------------------------------------] 58% est: 4s
plot: [7,5] [====================================================================>------------------------------------------------] 59% est: 4s
plot: [7,6] [=====================================================================>-----------------------------------------------] 60% est: 4s
plot: [7,7] [======================================================================>----------------------------------------------] 60% est: 4s
plot: [7,8] [=======================================================================>---------------------------------------------] 61% est: 4s
plot: [7,9] [========================================================================>--------------------------------------------] 62% est: 4s
plot: [7,10] [========================================================================>-------------------------------------------] 63% est: 3s
plot: [7,11] [=========================================================================>------------------------------------------] 64% est: 3s
plot: [8,1] [==========================================================================>------------------------------------------] 64% est: 3s
plot: [8,2] [===========================================================================>-----------------------------------------] 65% est: 3s
plot: [8,3] [============================================================================>----------------------------------------] 66% est: 3s
plot: [8,4] [=============================================================================>---------------------------------------] 67% est: 3s
plot: [8,5] [==============================================================================>--------------------------------------] 68% est: 3s
plot: [8,6] [===============================================================================>-------------------------------------] 69% est: 3s
plot: [8,7] [================================================================================>------------------------------------] 69% est: 3s
plot: [8,8] [=================================================================================>-----------------------------------] 70% est: 3s
plot: [8,9] [==================================================================================>----------------------------------] 71% est: 3s
plot: [8,10] [==================================================================================>---------------------------------] 72% est: 3s
plot: [8,11] [===================================================================================>--------------------------------] 73% est: 3s
plot: [9,1] [=====================================================================================>-------------------------------] 74% est: 3s
plot: [9,2] [======================================================================================>------------------------------] 74% est: 3s
plot: [9,3] [=======================================================================================>-----------------------------] 75% est: 2s
plot: [9,4] [========================================================================================>----------------------------] 76% est: 2s
plot: [9,5] [=========================================================================================>---------------------------] 77% est: 2s
plot: [9,6] [==========================================================================================>--------------------------] 78% est: 2s
plot: [9,7] [===========================================================================================>-------------------------] 79% est: 2s
plot: [9,8] [============================================================================================>------------------------] 79% est: 2s
plot: [9,9] [=============================================================================================>-----------------------] 80% est: 2s
plot: [9,10] [=============================================================================================>----------------------] 81% est: 2s
plot: [9,11] [==============================================================================================>---------------------] 82% est: 2s
plot: [10,1] [===============================================================================================>--------------------] 83% est: 2s
plot: [10,2] [================================================================================================>-------------------] 83% est: 2s
plot: [10,3] [=================================================================================================>------------------] 84% est: 2s
plot: [10,4] [==================================================================================================>-----------------] 85% est: 1s
plot: [10,5] [===================================================================================================>----------------] 86% est: 1s
plot: [10,6] [====================================================================================================>---------------] 87% est: 1s
plot: [10,7] [=====================================================================================================>--------------] 88% est: 1s
plot: [10,8] [======================================================================================================>-------------] 88% est: 1s
plot: [10,9] [=======================================================================================================>------------] 89% est: 1s
plot: [10,10] [=======================================================================================================>-----------] 90% est: 1s
plot: [10,11] [========================================================================================================>----------] 91% est: 1s
plot: [11,1] [=========================================================================================================>----------] 92% est: 1s
plot: [11,2] [==========================================================================================================>---------] 93% est: 1s
plot: [11,3] [===========================================================================================================>--------] 93% est: 1s
plot: [11,4] [============================================================================================================>-------] 94% est: 1s
plot: [11,5] [=============================================================================================================>------] 95% est: 0s
plot: [11,6] [==============================================================================================================>-----] 96% est: 0s
plot: [11,7] [===============================================================================================================>----] 97% est: 0s
plot: [11,8] [================================================================================================================>---] 98% est: 0s
plot: [11,9] [=================================================================================================================>--] 98% est: 0s
plot: [11,10] [=================================================================================================================>-] 99% est: 0s
plot: [11,11] [===================================================================================================================]100% est: 0s
ggpairs(avocado_trim_nonnumeric)
Error in stop_if_high_cardinality(data, columns, cardinality_threshold) :
Column 'region' has more levels (54) than the threshold (15) allowed.
Please remove the column or increase the 'cardinality_threshold' parameter. Increasing the cardinality_threshold may produce long processing times
So - some observations: Regions continue to cause problems - so need to rethink it. The quarters are being recognised as numeric, not categories - so need to recode
avocado_trim <- avocado_trim %>%
mutate(quarter = str_c("Q", quarter))
# Remove regions
avocado_trim_nr <- avocado_trim %>%
select(-c(region))
# Attempt two
avocado_trim_numeric <- avocado_trim_nr %>%
select_if(is.numeric)
avocado_trim_nonnumeric <- avocado_trim_nr %>%
select_if(function(x) !is.numeric(x))
avocado_trim_nonnumeric$average_price <- avocado_trim_nr$average_price
ggpairs(avocado_trim_numeric)
plot: [1,1] [>---------------------------------------------------------] 1% est: 0s
plot: [1,2] [>---------------------------------------------------------] 2% est: 4s
plot: [1,3] [=>--------------------------------------------------------] 3% est: 5s
plot: [1,4] [=>--------------------------------------------------------] 4% est: 5s
plot: [1,5] [==>-------------------------------------------------------] 5% est: 5s
plot: [1,6] [==>-------------------------------------------------------] 6% est: 5s
plot: [1,7] [===>------------------------------------------------------] 7% est: 5s
plot: [1,8] [====>-----------------------------------------------------] 8% est: 5s
plot: [1,9] [====>-----------------------------------------------------] 9% est: 5s
plot: [1,10] [=====>---------------------------------------------------] 10% est: 5s
plot: [2,1] [=====>----------------------------------------------------] 11% est: 5s
plot: [2,2] [======>---------------------------------------------------] 12% est: 5s
plot: [2,3] [=======>--------------------------------------------------] 13% est: 5s
plot: [2,4] [=======>--------------------------------------------------] 14% est: 5s
plot: [2,5] [========>-------------------------------------------------] 15% est: 5s
plot: [2,6] [========>-------------------------------------------------] 16% est: 5s
plot: [2,7] [=========>------------------------------------------------] 17% est: 4s
plot: [2,8] [=========>------------------------------------------------] 18% est: 4s
plot: [2,9] [==========>-----------------------------------------------] 19% est: 4s
plot: [2,10] [==========>----------------------------------------------] 20% est: 4s
plot: [3,1] [===========>----------------------------------------------] 21% est: 4s
plot: [3,2] [============>---------------------------------------------] 22% est: 4s
plot: [3,3] [============>---------------------------------------------] 23% est: 4s
plot: [3,4] [=============>--------------------------------------------] 24% est: 4s
plot: [3,5] [=============>--------------------------------------------] 25% est: 4s
plot: [3,6] [==============>-------------------------------------------] 26% est: 4s
plot: [3,7] [===============>------------------------------------------] 27% est: 4s
plot: [3,8] [===============>------------------------------------------] 28% est: 4s
plot: [3,9] [================>-----------------------------------------] 29% est: 4s
plot: [3,10] [================>----------------------------------------] 30% est: 4s
plot: [4,1] [=================>----------------------------------------] 31% est: 4s
plot: [4,2] [==================>---------------------------------------] 32% est: 4s
plot: [4,3] [==================>---------------------------------------] 33% est: 4s
plot: [4,4] [===================>--------------------------------------] 34% est: 4s
plot: [4,5] [===================>--------------------------------------] 35% est: 4s
plot: [4,6] [====================>-------------------------------------] 36% est: 4s
plot: [4,7] [====================>-------------------------------------] 37% est: 4s
plot: [4,8] [=====================>------------------------------------] 38% est: 4s
plot: [4,9] [======================>-----------------------------------] 39% est: 3s
plot: [4,10] [======================>----------------------------------] 40% est: 3s
plot: [5,1] [=======================>----------------------------------] 41% est: 3s
plot: [5,2] [=======================>----------------------------------] 42% est: 3s
plot: [5,3] [========================>---------------------------------] 43% est: 3s
plot: [5,4] [=========================>--------------------------------] 44% est: 3s
plot: [5,5] [=========================>--------------------------------] 45% est: 3s
plot: [5,6] [==========================>-------------------------------] 46% est: 3s
plot: [5,7] [==========================>-------------------------------] 47% est: 3s
plot: [5,8] [===========================>------------------------------] 48% est: 3s
plot: [5,9] [===========================>------------------------------] 49% est: 3s
plot: [5,10] [===========================>-----------------------------] 50% est: 3s
plot: [6,1] [=============================>----------------------------] 51% est: 3s
plot: [6,2] [=============================>----------------------------] 52% est: 3s
plot: [6,3] [==============================>---------------------------] 53% est: 3s
plot: [6,4] [==============================>---------------------------] 54% est: 3s
plot: [6,5] [===============================>--------------------------] 55% est: 3s
plot: [6,6] [===============================>--------------------------] 56% est: 3s
plot: [6,7] [================================>-------------------------] 57% est: 2s
plot: [6,8] [=================================>------------------------] 58% est: 2s
plot: [6,9] [=================================>------------------------] 59% est: 2s
plot: [6,10] [=================================>-----------------------] 60% est: 2s
plot: [7,1] [==================================>-----------------------] 61% est: 2s
plot: [7,2] [===================================>----------------------] 62% est: 2s
plot: [7,3] [====================================>---------------------] 63% est: 2s
plot: [7,4] [====================================>---------------------] 64% est: 2s
plot: [7,5] [=====================================>--------------------] 65% est: 2s
plot: [7,6] [=====================================>--------------------] 66% est: 2s
plot: [7,7] [======================================>-------------------] 67% est: 2s
plot: [7,8] [======================================>-------------------] 68% est: 2s
plot: [7,9] [=======================================>------------------] 69% est: 2s
plot: [7,10] [=======================================>-----------------] 70% est: 2s
plot: [8,1] [========================================>-----------------] 71% est: 2s
plot: [8,2] [=========================================>----------------] 72% est: 2s
plot: [8,3] [=========================================>----------------] 73% est: 2s
plot: [8,4] [==========================================>---------------] 74% est: 2s
plot: [8,5] [===========================================>--------------] 75% est: 1s
plot: [8,6] [===========================================>--------------] 76% est: 1s
plot: [8,7] [============================================>-------------] 77% est: 1s
plot: [8,8] [============================================>-------------] 78% est: 1s
plot: [8,9] [=============================================>------------] 79% est: 1s
plot: [8,10] [=============================================>-----------] 80% est: 1s
plot: [9,1] [==============================================>-----------] 81% est: 1s
plot: [9,2] [===============================================>----------] 82% est: 1s
plot: [9,3] [===============================================>----------] 83% est: 1s
plot: [9,4] [================================================>---------] 84% est: 1s
plot: [9,5] [================================================>---------] 85% est: 1s
plot: [9,6] [=================================================>--------] 86% est: 1s
plot: [9,7] [=================================================>--------] 87% est: 1s
plot: [9,8] [==================================================>-------] 88% est: 1s
plot: [9,9] [===================================================>------] 89% est: 1s
plot: [9,10] [==================================================>------] 90% est: 1s
plot: [10,1] [===================================================>-----] 91% est: 1s
plot: [10,2] [===================================================>-----] 92% est: 0s
plot: [10,3] [====================================================>----] 93% est: 0s
plot: [10,4] [=====================================================>---] 94% est: 0s
plot: [10,5] [=====================================================>---] 95% est: 0s
plot: [10,6] [======================================================>--] 96% est: 0s
plot: [10,7] [======================================================>--] 97% est: 0s
plot: [10,8] [=======================================================>-] 98% est: 0s
plot: [10,9] [=======================================================>-] 99% est: 0s
plot: [10,10] [========================================================]100% est: 0s
ggpairs(avocado_trim_nonnumeric)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Non-numeric Type is definitely a key variable Quarter has some influence
Numeric correlations (of average price) Year 0.093 xl bags -0.118 large bags -0.173 x4225 -0.173 small bags -0.175 total bags -0.177 x4770 -0.179 total volume -0.193 x4046 -0.208
The highest correlation scores (top three) x4046 -0.208 total volume -0.193 x4770 -0.179
to identify key variables
# exhaustive selection
regsubsets_exhaustive <- regsubsets(average_price ~ .,
data = avocado_trim_nr,
nvmax =8, # maxm size of subsets
method = "exhaustive")
sum_regsubsets_exhaustive <- summary(regsubsets_exhaustive)
sum_regsubsets_exhaustive
Subset selection object
Call: regsubsets.formula(average_price ~ ., data = avocado_trim_nr,
nvmax = 8, method = "exhaustive")
13 Variables (and intercept)
Forced in Forced out
total_volume FALSE FALSE
x4046 FALSE FALSE
x4225 FALSE FALSE
x4770 FALSE FALSE
total_bags FALSE FALSE
small_bags FALSE FALSE
large_bags FALSE FALSE
x_large_bags FALSE FALSE
typeorganic FALSE FALSE
year FALSE FALSE
quarterQ2 FALSE FALSE
quarterQ3 FALSE FALSE
quarterQ4 FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
total_volume x4046 x4225 x4770 total_bags small_bags large_bags x_large_bags
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " "*" "*" " " " " " " " " " "
7 ( 1 ) " " "*" "*" " " " " " " " " " "
8 ( 1 ) "*" " " "*" " " " " "*" " " " "
typeorganic year quarterQ2 quarterQ3 quarterQ4
1 ( 1 ) "*" " " " " " " " "
2 ( 1 ) "*" " " " " "*" " "
3 ( 1 ) "*" " " " " "*" "*"
4 ( 1 ) "*" "*" " " "*" "*"
5 ( 1 ) "*" "*" "*" "*" "*"
6 ( 1 ) "*" "*" " " "*" "*"
7 ( 1 ) "*" "*" "*" "*" "*"
8 ( 1 ) "*" "*" "*" "*" "*"
sum_regsubsets_exhaustive$which
(Intercept) total_volume x4046 x4225 x4770 total_bags small_bags large_bags
1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
7 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
8 TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
x_large_bags typeorganic year quarterQ2 quarterQ3 quarterQ4
1 FALSE TRUE FALSE FALSE FALSE FALSE
2 FALSE TRUE FALSE FALSE TRUE FALSE
3 FALSE TRUE FALSE FALSE TRUE TRUE
4 FALSE TRUE TRUE FALSE TRUE TRUE
5 FALSE TRUE TRUE TRUE TRUE TRUE
6 FALSE TRUE TRUE FALSE TRUE TRUE
7 FALSE TRUE TRUE TRUE TRUE TRUE
8 FALSE TRUE TRUE TRUE TRUE TRUE
plot(regsubsets_exhaustive, scale = "adjr2")
plot(regsubsets_exhaustive, scale = "bic")
plot(sum_regsubsets_exhaustive$rsq, type = "b")
Interestingly there is no elbow in the plot so there is no clear point at which to stop modelling.
plot(sum_regsubsets_exhaustive$bic, type = "b")
summary(regsubsets_exhaustive)$which[6,]
(Intercept) total_volume x4046 x4225 x4770 total_bags
TRUE FALSE TRUE TRUE FALSE FALSE
small_bags large_bags x_large_bags typeorganic year quarterQ2
FALSE FALSE FALSE TRUE TRUE FALSE
quarterQ3 quarterQ4
TRUE TRUE
Exhausting modelling suggests to us that the key variables (in order) are: type (organic) quarter (03) quarter(04) year quarter(02)
Average Price is our predicted value
Average price = 1.158 + (0.496 x Organic(type))
If an avocado is organic the price of it will increase by 0.496 assuming all other variables remain constant.
The p-value is less than 0.05 so we know this is statistically significant. The R^2 value tells us that 37.9% of the variation in the average price can be accounted by the avocado being organic.
Before we accept this as our first variable let’s check with our second predictor - quarter 3
Average Price is our predicted value
Average price = 1.30660 + (0.20631 x Organic(type))
If an avocado is organic the price of it will increase by 0.496 assuming all other variables remain constant.
The p-value is less than 0.05 so we know this is statistically significant. The R^2 value tells us that 4% of the variation in the average price can be accounted by the avocado being organic.
Model1a is definitely a better model than Model1b - so let’s choose type for the first variable.
Now we need to rerun the analysis to determine the next variable
avocado_rem_resid <- avocado_trim_nr %>%
add_residuals(model1a) %>%
select(-c("average_price", "type"))
ggpairs(avocado_rem_resid)
plot: [1,1] [----------------------------------------------------------] 1% est: 0s
plot: [1,2] [>---------------------------------------------------------] 2% est: 4s
plot: [1,3] [>---------------------------------------------------------] 2% est: 5s
plot: [1,4] [=>--------------------------------------------------------] 3% est: 7s
plot: [1,5] [=>--------------------------------------------------------] 4% est: 7s
plot: [1,6] [==>-------------------------------------------------------] 5% est: 7s
plot: [1,7] [==>-------------------------------------------------------] 6% est: 7s
plot: [1,8] [===>------------------------------------------------------] 7% est: 6s
plot: [1,9] [===>------------------------------------------------------] 7% est: 6s
plot: [1,10] [====>----------------------------------------------------] 8% est: 6s
plot: [1,11] [====>----------------------------------------------------] 9% est: 6s
plot: [2,1] [=====>----------------------------------------------------] 10% est: 7s
plot: [2,2] [=====>----------------------------------------------------] 11% est: 7s
plot: [2,3] [======>---------------------------------------------------] 12% est: 7s
plot: [2,4] [======>---------------------------------------------------] 12% est: 7s
plot: [2,5] [=======>--------------------------------------------------] 13% est: 6s
plot: [2,6] [=======>--------------------------------------------------] 14% est: 6s
plot: [2,7] [========>-------------------------------------------------] 15% est: 6s
plot: [2,8] [========>-------------------------------------------------] 16% est: 6s
plot: [2,9] [=========>------------------------------------------------] 17% est: 6s
plot: [2,10] [=========>-----------------------------------------------] 17% est: 6s
plot: [2,11] [=========>-----------------------------------------------] 18% est: 6s
plot: [3,1] [==========>-----------------------------------------------] 19% est: 6s
plot: [3,2] [===========>----------------------------------------------] 20% est: 6s
plot: [3,3] [===========>----------------------------------------------] 21% est: 6s
plot: [3,4] [===========>----------------------------------------------] 21% est: 6s
plot: [3,5] [============>---------------------------------------------] 22% est: 6s
plot: [3,6] [============>---------------------------------------------] 23% est: 6s
plot: [3,7] [=============>--------------------------------------------] 24% est: 6s
plot: [3,8] [=============>--------------------------------------------] 25% est: 6s
plot: [3,9] [==============>-------------------------------------------] 26% est: 6s
plot: [3,10] [==============>------------------------------------------] 26% est: 6s
plot: [3,11] [===============>-----------------------------------------] 27% est: 6s
plot: [4,1] [===============>------------------------------------------] 28% est: 6s
plot: [4,2] [================>-----------------------------------------] 29% est: 6s
plot: [4,3] [================>-----------------------------------------] 30% est: 5s
plot: [4,4] [=================>----------------------------------------] 31% est: 5s
plot: [4,5] [=================>----------------------------------------] 31% est: 5s
plot: [4,6] [==================>---------------------------------------] 32% est: 5s
plot: [4,7] [==================>---------------------------------------] 33% est: 5s
plot: [4,8] [===================>--------------------------------------] 34% est: 5s
plot: [4,9] [===================>--------------------------------------] 35% est: 5s
plot: [4,10] [===================>-------------------------------------] 36% est: 5s
plot: [4,11] [====================>------------------------------------] 36% est: 5s
plot: [5,1] [=====================>------------------------------------] 37% est: 5s
plot: [5,2] [=====================>------------------------------------] 38% est: 5s
plot: [5,3] [======================>-----------------------------------] 39% est: 5s
plot: [5,4] [======================>-----------------------------------] 40% est: 5s
plot: [5,5] [======================>-----------------------------------] 40% est: 5s
plot: [5,6] [=======================>----------------------------------] 41% est: 5s
plot: [5,7] [=======================>----------------------------------] 42% est: 5s
plot: [5,8] [========================>---------------------------------] 43% est: 4s
plot: [5,9] [========================>---------------------------------] 44% est: 4s
plot: [5,10] [========================>--------------------------------] 45% est: 4s
plot: [5,11] [=========================>-------------------------------] 45% est: 4s
plot: [6,1] [==========================>-------------------------------] 46% est: 4s
plot: [6,2] [==========================>-------------------------------] 47% est: 4s
plot: [6,3] [===========================>------------------------------] 48% est: 4s
plot: [6,4] [===========================>------------------------------] 49% est: 4s
plot: [6,5] [============================>-----------------------------] 50% est: 4s
plot: [6,6] [============================>-----------------------------] 50% est: 4s
plot: [6,7] [=============================>----------------------------] 51% est: 4s
plot: [6,8] [=============================>----------------------------] 52% est: 4s
plot: [6,9] [==============================>---------------------------] 53% est: 4s
plot: [6,10] [==============================>--------------------------] 54% est: 4s
plot: [6,11] [==============================>--------------------------] 55% est: 4s
plot: [7,1] [===============================>--------------------------] 55% est: 4s
plot: [7,2] [================================>-------------------------] 56% est: 3s
plot: [7,3] [================================>-------------------------] 57% est: 3s
plot: [7,4] [=================================>------------------------] 58% est: 3s
plot: [7,5] [=================================>------------------------] 59% est: 3s
plot: [7,6] [==================================>-----------------------] 60% est: 3s
plot: [7,7] [==================================>-----------------------] 60% est: 3s
plot: [7,8] [==================================>-----------------------] 61% est: 3s
plot: [7,9] [===================================>----------------------] 62% est: 3s
plot: [7,10] [===================================>---------------------] 63% est: 3s
plot: [7,11] [===================================>---------------------] 64% est: 3s
plot: [8,1] [====================================>---------------------] 64% est: 3s
plot: [8,2] [=====================================>--------------------] 65% est: 3s
plot: [8,3] [=====================================>--------------------] 66% est: 3s
plot: [8,4] [======================================>-------------------] 67% est: 3s
plot: [8,5] [======================================>-------------------] 68% est: 3s
plot: [8,6] [=======================================>------------------] 69% est: 3s
plot: [8,7] [=======================================>------------------] 69% est: 2s
plot: [8,8] [========================================>-----------------] 70% est: 2s
plot: [8,9] [========================================>-----------------] 71% est: 2s
plot: [8,10] [========================================>----------------] 72% est: 2s
plot: [8,11] [========================================>----------------] 73% est: 2s
plot: [9,1] [==========================================>---------------] 74% est: 2s
plot: [9,2] [==========================================>---------------] 74% est: 2s
plot: [9,3] [===========================================>--------------] 75% est: 2s
plot: [9,4] [===========================================>--------------] 76% est: 2s
plot: [9,5] [============================================>-------------] 77% est: 2s
plot: [9,6] [============================================>-------------] 78% est: 2s
plot: [9,7] [=============================================>------------] 79% est: 2s
plot: [9,8] [=============================================>------------] 79% est: 2s
plot: [9,9] [=============================================>------------] 80% est: 2s
plot: [9,10] [=============================================>-----------] 81% est: 2s
plot: [9,11] [==============================================>----------] 82% est: 1s
plot: [10,1] [==============================================>----------] 83% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,2] [===============================================>---------] 83% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,3] [===============================================>---------] 84% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,4] [================================================>--------] 85% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,5] [================================================>--------] 86% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,6] [================================================>--------] 87% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,7] [=================================================>-------] 88% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,8] [=================================================>-------] 88% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,9] [==================================================>------] 89% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,10] [=================================================>------] 90% est: 1s
plot: [10,11] [==================================================>-----] 91% est: 1s
plot: [11,1] [===================================================>-----] 92% est: 1s
plot: [11,2] [====================================================>----] 93% est: 1s
plot: [11,3] [====================================================>----] 93% est: 1s
plot: [11,4] [=====================================================>---] 94% est: 1s
plot: [11,5] [=====================================================>---] 95% est: 0s
plot: [11,6] [======================================================>--] 96% est: 0s
plot: [11,7] [======================================================>--] 97% est: 0s
plot: [11,8] [=======================================================>-] 98% est: 0s
plot: [11,9] [=======================================================>-] 98% est: 0s
plot: [11,10] [=======================================================>] 99% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [11,11] [========================================================]100% est: 0s
Coefficients (of the Resid) total volume -0.063 x4046 -0.088 x4225 -0.038 x4770 -0.064 total bags -0.055 small bags -0.049 large bags -0.069 xl bags -0.012 year 0.118 quarter - some variation - Do I need to make this a dummy?
# exhaustive selection
regsubsets_exhaustive2 <- regsubsets(resid ~ .,
data = avocado_rem_resid,
nvmax =8, # maxm size of subsets
method = "exhaustive")
sum_regsubsets_exhaustive2 <- summary(regsubsets_exhaustive2)
sum_regsubsets_exhaustive2
Subset selection object
Call: regsubsets.formula(resid ~ ., data = avocado_rem_resid, nvmax = 8,
method = "exhaustive")
12 Variables (and intercept)
Forced in Forced out
total_volume FALSE FALSE
x4046 FALSE FALSE
x4225 FALSE FALSE
x4770 FALSE FALSE
total_bags FALSE FALSE
small_bags FALSE FALSE
large_bags FALSE FALSE
x_large_bags FALSE FALSE
year FALSE FALSE
quarterQ2 FALSE FALSE
quarterQ3 FALSE FALSE
quarterQ4 FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
total_volume x4046 x4225 x4770 total_bags small_bags large_bags x_large_bags
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " "*" "*" " " " " " " " " " "
6 ( 1 ) " " "*" "*" " " " " " " " " " "
7 ( 1 ) "*" " " "*" " " " " "*" " " " "
8 ( 1 ) "*" "*" " " "*" " " " " "*" " "
year quarterQ2 quarterQ3 quarterQ4
1 ( 1 ) " " " " "*" " "
2 ( 1 ) " " " " "*" "*"
3 ( 1 ) "*" " " "*" "*"
4 ( 1 ) "*" "*" "*" "*"
5 ( 1 ) "*" " " "*" "*"
6 ( 1 ) "*" "*" "*" "*"
7 ( 1 ) "*" "*" "*" "*"
8 ( 1 ) "*" "*" "*" "*"
Top variables are Q3, Q4, Year, Q2
I tried to put region back in but it is still running errors - I may test it anyway
So - let’s compare quarter, year and region and see which works best
# model 2a - using quarter as the variable
# bringing back in the original dataset with regions
model2a <- lm(average_price ~ type + quarter, data = avocado_trim)
model2a
Call:
lm(formula = average_price ~ type + quarter, data = avocado_trim)
Coefficients:
(Intercept) typeorganic quarterQ2 quarterQ3 quarterQ4
1.05863 0.49596 0.06855 0.20631 0.15204
Average Price is our predicted value
For Quarter 3 Average price = 1.05863 + (0.496 x Organic(type) + (0.20631 x Quarter3))
If an avocado is organic and picked in quarter 3 the price of it will increase by 0.496 + 0.20631 assuming all other variables remain constant.
summary(model2a)
Call:
lm(formula = average_price ~ type + quarter, data = avocado_trim)
Residuals:
Min 1Q Median 3Q Max
-1.11458 -0.20089 -0.02458 0.18542 1.54687
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.058626 0.004718 224.38 <2e-16 ***
typeorganic 0.495958 0.004543 109.16 <2e-16 ***
quarterQ2 0.068546 0.006282 10.91 <2e-16 ***
quarterQ3 0.206308 0.006281 32.84 <2e-16 ***
quarterQ4 0.152040 0.006237 24.38 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3069 on 18244 degrees of freedom
Multiple R-squared: 0.4193, Adjusted R-squared: 0.4192
F-statistic: 3294 on 4 and 18244 DF, p-value: < 2.2e-16
The p-value is less than 0.05 so we know this is statistically significant. The R^2 value tells us that 41.93% of the variation in the average price can be accounted by the avocado being organic.
par(mfrow = c(2,2))
plot(model2a)
I am liking the Q-Q here
# model 2b - using year as the variable
# bringing back in the original dataset with regions
model2b <- lm(average_price ~ type + year, data = avocado_trim)
model2b
Call:
lm(formula = average_price ~ type + year, data = avocado_trim)
Coefficients:
(Intercept) typeorganic year
-79.35649 0.49596 0.03993
Going to stop looking at year now - as it is treating it as a numeric
# model 2c - using quarter as the variable
# bringing back in the original dataset with regions
model2c <- lm(average_price ~ type + region, data = avocado_trim)
model2c
Call:
lm(formula = average_price ~ type + region, data = avocado_trim)
Coefficients:
(Intercept) typeorganic regionAtlanta
1.313079 0.495912 -0.223077
regionBaltimoreWashington regionBoise regionBoston
-0.026805 -0.212899 -0.030148
regionBuffaloRochester regionCalifornia regionCharlotte
-0.044201 -0.165710 0.045000
regionChicago regionCincinnatiDayton regionColumbus
-0.004260 -0.351834 -0.308254
regionDallasFtWorth regionDenver regionDetroit
-0.475444 -0.342456 -0.284941
regionGrandRapids regionGreatLakes regionHarrisburgScranton
-0.056036 -0.222485 -0.047751
regionHartfordSpringfield regionHouston regionIndianapolis
0.257604 -0.513107 -0.247041
regionJacksonville regionLasVegas regionLosAngeles
-0.050089 -0.180118 -0.345030
regionLouisville regionMiamiFtLauderdale regionMidsouth
-0.274349 -0.132544 -0.156272
regionNashville regionNewOrleansMobile regionNewYork
-0.348935 -0.256243 0.166538
regionNortheast regionNorthernNewEngland regionOrlando
0.040888 -0.083639 -0.054822
regionPhiladelphia regionPhoenixTucson regionPittsburgh
0.071095 -0.336598 -0.196716
regionPlains regionPortland regionRaleighGreensboro
-0.124527 -0.243314 -0.005917
regionRichmondNorfolk regionRoanoke regionSacramento
-0.269704 -0.313107 0.060533
regionSanDiego regionSanFrancisco regionSeattle
-0.162870 0.243166 -0.118462
regionSouthCarolina regionSouthCentral regionSoutheast
-0.157751 -0.459793 -0.163018
regionSpokane regionStLouis regionSyracuse
-0.115444 -0.130414 -0.040710
regionTampa regionTotalUS regionWest
-0.152189 -0.242012 -0.288817
regionWestTexNewMexico
-0.297114
Oh wow!! This will take some analysis - so let’s look at the summary
summary(model2c)
Call:
lm(formula = average_price ~ type + region, data = avocado_trim)
Residuals:
Min 1Q Median 3Q Max
-1.09858 -0.16716 -0.01814 0.14692 1.51320
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.313079 0.014894 88.159 < 2e-16 ***
typeorganic 0.495912 0.004017 123.452 < 2e-16 ***
regionAtlanta -0.223077 0.020871 -10.688 < 2e-16 ***
regionBaltimoreWashington -0.026805 0.020871 -1.284 0.19906
regionBoise -0.212899 0.020871 -10.201 < 2e-16 ***
regionBoston -0.030148 0.020871 -1.444 0.14863
regionBuffaloRochester -0.044201 0.020871 -2.118 0.03421 *
regionCalifornia -0.165710 0.020871 -7.940 2.15e-15 ***
regionCharlotte 0.045000 0.020871 2.156 0.03109 *
regionChicago -0.004260 0.020871 -0.204 0.83826
regionCincinnatiDayton -0.351834 0.020871 -16.857 < 2e-16 ***
regionColumbus -0.308254 0.020871 -14.769 < 2e-16 ***
regionDallasFtWorth -0.475444 0.020871 -22.780 < 2e-16 ***
regionDenver -0.342456 0.020871 -16.408 < 2e-16 ***
regionDetroit -0.284941 0.020871 -13.652 < 2e-16 ***
regionGrandRapids -0.056036 0.020871 -2.685 0.00726 **
regionGreatLakes -0.222485 0.020871 -10.660 < 2e-16 ***
regionHarrisburgScranton -0.047751 0.020871 -2.288 0.02216 *
regionHartfordSpringfield 0.257604 0.020871 12.342 < 2e-16 ***
regionHouston -0.513107 0.020871 -24.584 < 2e-16 ***
regionIndianapolis -0.247041 0.020871 -11.836 < 2e-16 ***
regionJacksonville -0.050089 0.020871 -2.400 0.01641 *
regionLasVegas -0.180118 0.020871 -8.630 < 2e-16 ***
regionLosAngeles -0.345030 0.020871 -16.531 < 2e-16 ***
regionLouisville -0.274349 0.020871 -13.145 < 2e-16 ***
regionMiamiFtLauderdale -0.132544 0.020871 -6.351 2.20e-10 ***
regionMidsouth -0.156272 0.020871 -7.487 7.35e-14 ***
regionNashville -0.348935 0.020871 -16.718 < 2e-16 ***
regionNewOrleansMobile -0.256243 0.020871 -12.277 < 2e-16 ***
regionNewYork 0.166538 0.020871 7.979 1.56e-15 ***
regionNortheast 0.040888 0.020871 1.959 0.05013 .
regionNorthernNewEngland -0.083639 0.020871 -4.007 6.16e-05 ***
regionOrlando -0.054822 0.020871 -2.627 0.00863 **
regionPhiladelphia 0.071095 0.020871 3.406 0.00066 ***
regionPhoenixTucson -0.336598 0.020871 -16.127 < 2e-16 ***
regionPittsburgh -0.196716 0.020871 -9.425 < 2e-16 ***
regionPlains -0.124527 0.020871 -5.966 2.47e-09 ***
regionPortland -0.243314 0.020871 -11.658 < 2e-16 ***
regionRaleighGreensboro -0.005917 0.020871 -0.284 0.77679
regionRichmondNorfolk -0.269704 0.020871 -12.922 < 2e-16 ***
regionRoanoke -0.313107 0.020871 -15.002 < 2e-16 ***
regionSacramento 0.060533 0.020871 2.900 0.00373 **
regionSanDiego -0.162870 0.020871 -7.803 6.35e-15 ***
regionSanFrancisco 0.243166 0.020871 11.651 < 2e-16 ***
regionSeattle -0.118462 0.020871 -5.676 1.40e-08 ***
regionSouthCarolina -0.157751 0.020871 -7.558 4.28e-14 ***
regionSouthCentral -0.459793 0.020871 -22.030 < 2e-16 ***
regionSoutheast -0.163018 0.020871 -7.811 6.00e-15 ***
regionSpokane -0.115444 0.020871 -5.531 3.22e-08 ***
regionStLouis -0.130414 0.020871 -6.248 4.24e-10 ***
regionSyracuse -0.040710 0.020871 -1.951 0.05113 .
regionTampa -0.152189 0.020871 -7.292 3.18e-13 ***
regionTotalUS -0.242012 0.020871 -11.595 < 2e-16 ***
regionWest -0.288817 0.020871 -13.838 < 2e-16 ***
regionWestTexNewMexico -0.297114 0.020918 -14.204 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2713 on 18194 degrees of freedom
Multiple R-squared: 0.5473, Adjusted R-squared: 0.546
F-statistic: 407.4 on 54 and 18194 DF, p-value: < 2.2e-16
The p-value is mostly less than 0.05 but there are some regions where is is greater than 0.05 which could make the data misleading
The R^2 value tells us that 54.73% of the variation in the average price can be accounted by the avocado being organic and by the region it is in
par(mfrow = c(2,2))
plot(model2c)
Time to use anova to compare the models:
anova(model1a, model2a)
Analysis of Variance Table
Model 1: average_price ~ type
Model 2: average_price ~ type + quarter
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18247 1836.7
2 18244 1718.2 3 118.54 419.56 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The null hypothesis here is that the models explain the same amount of response variance. The alternative is that they don’t. In this case, we find a p-value less than 0.05, and so we reject the null hypothesis and say that the model including type is significantly better than the model excluding it!
However, the model including region is still better overall (with higher r2), and so we choose region over quarter in this case. But perhaps we can include it as a third variable?
anova(model1a, model2c)
Analysis of Variance Table
Model 1: average_price ~ type
Model 2: average_price ~ type + region
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18247 1836.7
2 18194 1339.4 53 497.26 127.44 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avocado_rem_resid2 <- avocado_trim %>%
add_residuals(model2c) %>%
select(-c("average_price", "type", "region"))
ggpairs(avocado_rem_resid2)
plot: [1,1] [----------------------------------------------------------] 1% est: 0s
plot: [1,2] [>---------------------------------------------------------] 2% est: 4s
plot: [1,3] [>---------------------------------------------------------] 2% est: 5s
plot: [1,4] [=>--------------------------------------------------------] 3% est: 5s
plot: [1,5] [=>--------------------------------------------------------] 4% est: 5s
plot: [1,6] [==>-------------------------------------------------------] 5% est: 5s
plot: [1,7] [==>-------------------------------------------------------] 6% est: 5s
plot: [1,8] [===>------------------------------------------------------] 7% est: 5s
plot: [1,9] [===>------------------------------------------------------] 7% est: 5s
plot: [1,10] [====>----------------------------------------------------] 8% est: 5s
plot: [1,11] [====>----------------------------------------------------] 9% est: 6s
plot: [2,1] [=====>----------------------------------------------------] 10% est: 6s
plot: [2,2] [=====>----------------------------------------------------] 11% est: 6s
plot: [2,3] [======>---------------------------------------------------] 12% est: 6s
plot: [2,4] [======>---------------------------------------------------] 12% est: 6s
plot: [2,5] [=======>--------------------------------------------------] 13% est: 6s
plot: [2,6] [=======>--------------------------------------------------] 14% est: 6s
plot: [2,7] [========>-------------------------------------------------] 15% est: 5s
plot: [2,8] [========>-------------------------------------------------] 16% est: 5s
plot: [2,9] [=========>------------------------------------------------] 17% est: 5s
plot: [2,10] [=========>-----------------------------------------------] 17% est: 5s
plot: [2,11] [=========>-----------------------------------------------] 18% est: 6s
plot: [3,1] [==========>-----------------------------------------------] 19% est: 5s
plot: [3,2] [===========>----------------------------------------------] 20% est: 5s
plot: [3,3] [===========>----------------------------------------------] 21% est: 6s
plot: [3,4] [===========>----------------------------------------------] 21% est: 5s
plot: [3,5] [============>---------------------------------------------] 22% est: 5s
plot: [3,6] [============>---------------------------------------------] 23% est: 5s
plot: [3,7] [=============>--------------------------------------------] 24% est: 5s
plot: [3,8] [=============>--------------------------------------------] 25% est: 5s
plot: [3,9] [==============>-------------------------------------------] 26% est: 5s
plot: [3,10] [==============>------------------------------------------] 26% est: 5s
plot: [3,11] [===============>-----------------------------------------] 27% est: 5s
plot: [4,1] [===============>------------------------------------------] 28% est: 5s
plot: [4,2] [================>-----------------------------------------] 29% est: 5s
plot: [4,3] [================>-----------------------------------------] 30% est: 5s
plot: [4,4] [=================>----------------------------------------] 31% est: 5s
plot: [4,5] [=================>----------------------------------------] 31% est: 5s
plot: [4,6] [==================>---------------------------------------] 32% est: 5s
plot: [4,7] [==================>---------------------------------------] 33% est: 5s
plot: [4,8] [===================>--------------------------------------] 34% est: 5s
plot: [4,9] [===================>--------------------------------------] 35% est: 5s
plot: [4,10] [===================>-------------------------------------] 36% est: 5s
plot: [4,11] [====================>------------------------------------] 36% est: 5s
plot: [5,1] [=====================>------------------------------------] 37% est: 5s
plot: [5,2] [=====================>------------------------------------] 38% est: 4s
plot: [5,3] [======================>-----------------------------------] 39% est: 4s
plot: [5,4] [======================>-----------------------------------] 40% est: 4s
plot: [5,5] [======================>-----------------------------------] 40% est: 4s
plot: [5,6] [=======================>----------------------------------] 41% est: 4s
plot: [5,7] [=======================>----------------------------------] 42% est: 4s
plot: [5,8] [========================>---------------------------------] 43% est: 4s
plot: [5,9] [========================>---------------------------------] 44% est: 4s
plot: [5,10] [========================>--------------------------------] 45% est: 4s
plot: [5,11] [=========================>-------------------------------] 45% est: 4s
plot: [6,1] [==========================>-------------------------------] 46% est: 4s
plot: [6,2] [==========================>-------------------------------] 47% est: 4s
plot: [6,3] [===========================>------------------------------] 48% est: 4s
plot: [6,4] [===========================>------------------------------] 49% est: 4s
plot: [6,5] [============================>-----------------------------] 50% est: 4s
plot: [6,6] [============================>-----------------------------] 50% est: 4s
plot: [6,7] [=============================>----------------------------] 51% est: 4s
plot: [6,8] [=============================>----------------------------] 52% est: 3s
plot: [6,9] [==============================>---------------------------] 53% est: 3s
plot: [6,10] [==============================>--------------------------] 54% est: 3s
plot: [6,11] [==============================>--------------------------] 55% est: 3s
plot: [7,1] [===============================>--------------------------] 55% est: 3s
plot: [7,2] [================================>-------------------------] 56% est: 3s
plot: [7,3] [================================>-------------------------] 57% est: 3s
plot: [7,4] [=================================>------------------------] 58% est: 3s
plot: [7,5] [=================================>------------------------] 59% est: 3s
plot: [7,6] [==================================>-----------------------] 60% est: 3s
plot: [7,7] [==================================>-----------------------] 60% est: 3s
plot: [7,8] [==================================>-----------------------] 61% est: 3s
plot: [7,9] [===================================>----------------------] 62% est: 3s
plot: [7,10] [===================================>---------------------] 63% est: 3s
plot: [7,11] [===================================>---------------------] 64% est: 3s
plot: [8,1] [====================================>---------------------] 64% est: 3s
plot: [8,2] [=====================================>--------------------] 65% est: 3s
plot: [8,3] [=====================================>--------------------] 66% est: 2s
plot: [8,4] [======================================>-------------------] 67% est: 2s
plot: [8,5] [======================================>-------------------] 68% est: 2s
plot: [8,6] [=======================================>------------------] 69% est: 2s
plot: [8,7] [=======================================>------------------] 69% est: 2s
plot: [8,8] [========================================>-----------------] 70% est: 2s
plot: [8,9] [========================================>-----------------] 71% est: 2s
plot: [8,10] [========================================>----------------] 72% est: 2s
plot: [8,11] [========================================>----------------] 73% est: 2s
plot: [9,1] [==========================================>---------------] 74% est: 2s
plot: [9,2] [==========================================>---------------] 74% est: 2s
plot: [9,3] [===========================================>--------------] 75% est: 2s
plot: [9,4] [===========================================>--------------] 76% est: 2s
plot: [9,5] [============================================>-------------] 77% est: 2s
plot: [9,6] [============================================>-------------] 78% est: 2s
plot: [9,7] [=============================================>------------] 79% est: 2s
plot: [9,8] [=============================================>------------] 79% est: 2s
plot: [9,9] [=============================================>------------] 80% est: 1s
plot: [9,10] [=============================================>-----------] 81% est: 1s
plot: [9,11] [==============================================>----------] 82% est: 1s
plot: [10,1] [==============================================>----------] 83% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,2] [===============================================>---------] 83% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,3] [===============================================>---------] 84% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,4] [================================================>--------] 85% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,5] [================================================>--------] 86% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,6] [================================================>--------] 87% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,7] [=================================================>-------] 88% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,8] [=================================================>-------] 88% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,9] [==================================================>------] 89% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [10,10] [=================================================>------] 90% est: 1s
plot: [10,11] [==================================================>-----] 91% est: 1s
plot: [11,1] [===================================================>-----] 92% est: 1s
plot: [11,2] [====================================================>----] 93% est: 1s
plot: [11,3] [====================================================>----] 93% est: 1s
plot: [11,4] [=====================================================>---] 94% est: 1s
plot: [11,5] [=====================================================>---] 95% est: 0s
plot: [11,6] [======================================================>--] 96% est: 0s
plot: [11,7] [======================================================>--] 97% est: 0s
plot: [11,8] [=======================================================>-] 98% est: 0s
plot: [11,9] [=======================================================>-] 98% est: 0s
plot: [11,10] [=======================================================>] 99% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [11,11] [========================================================]100% est: 0s
Coefficients (of the Resid) total volume -0.017 x4046 -0.018 x4225 -0.023 x4770 -0.024 total bags -0.005 small bags -0.005 large bags -0.008 xl bags -0.031 year 0.139 - disregard this quarter - some variation - Do I need to make this a dummy?
regsubsets_forward <- regsubsets(average_price ~ .,
data = avocado_trim,
nvmax = 12,
method = "forward")
plot(regsubsets_forward)
# See what's in model
plot(summary(regsubsets_forward)$bic, type = "b")
summary(regsubsets_forward)$which[8, ]
(Intercept) total_volume x4046 x4225 x4770
TRUE FALSE FALSE FALSE FALSE
total_bags small_bags large_bags x_large_bags typeorganic
FALSE FALSE FALSE FALSE TRUE
year regionAtlanta regionBaltimoreWashington regionBoise regionBoston
TRUE FALSE FALSE FALSE FALSE
regionBuffaloRochester regionCalifornia regionCharlotte regionChicago regionCincinnatiDayton
FALSE FALSE FALSE FALSE FALSE
regionColumbus regionDallasFtWorth regionDenver regionDetroit regionGrandRapids
FALSE TRUE FALSE FALSE FALSE
regionGreatLakes regionHarrisburgScranton regionHartfordSpringfield regionHouston regionIndianapolis
FALSE FALSE TRUE TRUE FALSE
regionJacksonville regionLasVegas regionLosAngeles regionLouisville regionMiamiFtLauderdale
FALSE FALSE FALSE FALSE FALSE
regionMidsouth regionNashville regionNewOrleansMobile regionNewYork regionNortheast
FALSE FALSE FALSE TRUE FALSE
regionNorthernNewEngland regionOrlando regionPhiladelphia regionPhoenixTucson regionPittsburgh
FALSE FALSE FALSE FALSE FALSE
regionPlains regionPortland regionRaleighGreensboro regionRichmondNorfolk regionRoanoke
FALSE FALSE FALSE FALSE FALSE
regionSacramento regionSanDiego regionSanFrancisco regionSeattle regionSouthCarolina
FALSE FALSE TRUE FALSE FALSE
regionSouthCentral regionSoutheast regionSpokane regionStLouis regionSyracuse
FALSE FALSE FALSE FALSE FALSE
regionTampa regionTotalUS regionWest regionWestTexNewMexico quarter
FALSE FALSE FALSE FALSE TRUE
# test if we should put regions in
mod_type_year <- lm(average_price ~ type + year, data = avocado_trim)
mod_type_region <- lm(average_price ~ type + year + region, data = avocado_trim)
anova(mod_type_year, mod_type_region)
Analysis of Variance Table
Model 1: average_price ~ type + year
Model 2: average_price ~ type + year + region
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18246 1811.0
2 18193 1313.7 53 497.25 129.93 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# test if we should put year in
mod_type_year <- lm(average_price ~ type + year, data = avocado_trim)
mod_type_quarter <- lm(average_price ~ type + year + quarter, data = avocado_trim)
anova(mod_type_year, mod_type_quarter)
Analysis of Variance Table
Model 1: average_price ~ type + year
Model 2: average_price ~ type + year + quarter
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18246 1811.0
2 18245 1702.4 1 108.58 1163.7 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# now let's test if the one with region and quarter is different than the one with just region
mod_type_region_quarter <- lm(average_price ~ type + year + region + quarter, data = avocado_trim)
anova(mod_type_region_quarter, mod_type_region)
Analysis of Variance Table
Model 1: average_price ~ type + year + region + quarter
Model 2: average_price ~ type + year + region
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18192 1205.2
2 18193 1313.7 -1 -108.56 1638.8 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1